Olist E-commerce Customer Segmentation Project¶
Context¶
This project involves analyzing Olist's e-commerce data to develop customer segmentation and key performance indicators (KPIs) for a Customer Experience Dashboard. The goal is to extract actionable insights for marketing campaigns and customer behavior analysis.
Project Objectives¶
- Implement SQL queries to extract data for the Customer Experience Dashboard.
- Develop customer segmentation to support targeted marketing campaigns.
- Analyze customer behavior patterns to identify trends and opportunities.
# Configure Plotly to properly render in HTML exports
import plotly.io as pio
import plotly.graph_objects as go
import os
# Set the renderer for notebook display
pio.renderers.default = "notebook"
# Enable figure exports with all necessary dependencies embedded
#pio.write_html_config.include_plotlyjs = 'cdn'
#pio.write_html_config.include_mathjax = 'cdn'
#pio.write_html_config.full_html = True
# Configure global theme for consistent appearance
pio.templates.default = "plotly_white"
1. Data Exploration¶
1.1 Database Connection Setup¶
- Objective: Establish a connection to the SQLite database and verify the available tables.
- Steps:
- Initialize the database connection.
- List all available tables in the dataset.
- Preview the first few rows of each table to understand its structure.
from src.scripts.sqlite_connector import DatabaseConnection
import os
# Get current working directory
cwd = os.getcwd()
# Initialize database connection
db_path = os.path.join(cwd, 'dataset', 'olist.db')
db = DatabaseConnection(db_path)
# Get all table names
tables = db.get_table_names()
print("Available tables:", tables)
# Read specific table
for table in tables:
orders_df = db.read_table(table)
display(orders_df.head())
Available tables: ['customers', 'geoloc', 'order_items', 'order_pymts', 'order_reviews', 'orders', 'products', 'sellers', 'translation']
| index | customer_id | customer_unique_id | customer_zip_code_prefix | customer_city | customer_state | |
|---|---|---|---|---|---|---|
| 0 | 0 | 06b8999e2fba1a1fbc88172c00ba8bc7 | 861eff4711a542e4b93843c6dd7febb0 | 14409 | franca | SP |
| 1 | 1 | 18955e83d337fd6b2def6b18a428ac77 | 290c77bc529b7ac935b93aa66c333dc3 | 9790 | sao bernardo do campo | SP |
| 2 | 2 | 4e7b3e00288586ebd08712fdd0374a03 | 060e732b5b29e8181a18229c7b0b2b5e | 1151 | sao paulo | SP |
| 3 | 3 | b2b6027bc5c5109e529d4dc6358b12c3 | 259dac757896d24d7702b9acbbff3f3c | 8775 | mogi das cruzes | SP |
| 4 | 4 | 4f2d8ab171c80ec8364f7c12e35b23ad | 345ecd01c38d18a9036ed96c73b8d066 | 13056 | campinas | SP |
| index | geolocation_zip_code_prefix | geolocation_lat | geolocation_lng | geolocation_city | geolocation_state | |
|---|---|---|---|---|---|---|
| 0 | 0 | 1037 | -23.545621 | -46.639292 | sao paulo | SP |
| 1 | 1 | 1046 | -23.546081 | -46.644820 | sao paulo | SP |
| 2 | 2 | 1046 | -23.546129 | -46.642951 | sao paulo | SP |
| 3 | 3 | 1041 | -23.544392 | -46.639499 | sao paulo | SP |
| 4 | 4 | 1035 | -23.541578 | -46.641607 | sao paulo | SP |
| index | order_id | order_item_id | product_id | seller_id | shipping_limit_date | price | freight_value | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 00010242fe8c5a6d1ba2dd792cb16214 | 1 | 4244733e06e7ecb4970a6e2683c13e61 | 48436dade18ac8b2bce089ec2a041202 | 2017-09-19 09:45:35 | 58.90 | 13.29 |
| 1 | 1 | 00018f77f2f0320c557190d7a144bdd3 | 1 | e5f2d52b802189ee658865ca93d83a8f | dd7ddc04e1b6c2c614352b383efe2d36 | 2017-05-03 11:05:13 | 239.90 | 19.93 |
| 2 | 2 | 000229ec398224ef6ca0657da4fc703e | 1 | c777355d18b72b67abbeef9df44fd0fd | 5b51032eddd242adc84c38acab88f23d | 2018-01-18 14:48:30 | 199.00 | 17.87 |
| 3 | 3 | 00024acbcdf0a6daa1e931b038114c75 | 1 | 7634da152a4610f1595efa32f14722fc | 9d7a1d34a5052409006425275ba1c2b4 | 2018-08-15 10:10:18 | 12.99 | 12.79 |
| 4 | 4 | 00042b26cf59d7ce69dfabb4e55b4fd9 | 1 | ac6c3623068f30de03045865e4e10089 | df560393f3a51e74553ab94004ba5c87 | 2017-02-13 13:57:51 | 199.90 | 18.14 |
| index | order_id | payment_sequential | payment_type | payment_installments | payment_value | |
|---|---|---|---|---|---|---|
| 0 | 0 | b81ef226f3fe1789b1e8b2acac839d17 | 1 | credit_card | 8 | 99.33 |
| 1 | 1 | a9810da82917af2d9aefd1278f1dcfa0 | 1 | credit_card | 1 | 24.39 |
| 2 | 2 | 25e8ea4e93396b6fa0d3dd708e76c1bd | 1 | credit_card | 1 | 65.71 |
| 3 | 3 | ba78997921bbcdc1373bb41e913ab953 | 1 | credit_card | 8 | 107.78 |
| 4 | 4 | 42fdf880ba16b47b59251dd489d4441a | 1 | credit_card | 2 | 128.45 |
| index | review_id | order_id | review_score | review_comment_title | review_comment_message | review_creation_date | review_answer_timestamp | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 7bc2406110b926393aa56f80a40eba40 | 73fc7af87114b39712e6da79b0a377eb | 4 | None | None | 2018-01-18 00:00:00 | 2018-01-18 21:46:59 |
| 1 | 1 | 80e641a11e56f04c1ad469d5645fdfde | a548910a1c6147796b98fdf73dbeba33 | 5 | None | None | 2018-03-10 00:00:00 | 2018-03-11 03:05:13 |
| 2 | 2 | 228ce5500dc1d8e020d8d1322874b6f0 | f9e4b658b201a9f2ecdecbb34bed034b | 5 | None | None | 2018-02-17 00:00:00 | 2018-02-18 14:36:24 |
| 3 | 3 | e64fb393e7b32834bb789ff8bb30750e | 658677c97b385a9be170737859d3511b | 5 | None | Recebi bem antes do prazo estipulado. | 2017-04-21 00:00:00 | 2017-04-21 22:02:06 |
| 4 | 4 | f7c4243c7fe1938f181bec41a392bdeb | 8e6bfb81e283fa7e4f11123a3fb894f1 | 5 | None | Parabéns lojas lannister adorei comprar pela I... | 2018-03-01 00:00:00 | 2018-03-02 10:26:53 |
| index | order_id | customer_id | order_status | order_purchase_timestamp | order_approved_at | order_delivered_carrier_date | order_delivered_customer_date | order_estimated_delivery_date | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | e481f51cbdc54678b7cc49136f2d6af7 | 9ef432eb6251297304e76186b10a928d | delivered | 2017-10-02 10:56:33 | 2017-10-02 11:07:15 | 2017-10-04 19:55:00 | 2017-10-10 21:25:13 | 2017-10-18 00:00:00 |
| 1 | 1 | 53cdb2fc8bc7dce0b6741e2150273451 | b0830fb4747a6c6d20dea0b8c802d7ef | delivered | 2018-07-24 20:41:37 | 2018-07-26 03:24:27 | 2018-07-26 14:31:00 | 2018-08-07 15:27:45 | 2018-08-13 00:00:00 |
| 2 | 2 | 47770eb9100c2d0c44946d9cf07ec65d | 41ce2a54c0b03bf3443c3d931a367089 | delivered | 2018-08-08 08:38:49 | 2018-08-08 08:55:23 | 2018-08-08 13:50:00 | 2018-08-17 18:06:29 | 2018-09-04 00:00:00 |
| 3 | 3 | 949d5b44dbf5de918fe9c16f97b45f8a | f88197465ea7920adcdbec7375364d82 | delivered | 2017-11-18 19:28:06 | 2017-11-18 19:45:59 | 2017-11-22 13:39:59 | 2017-12-02 00:28:42 | 2017-12-15 00:00:00 |
| 4 | 4 | ad21c59c0840e6cb83a9ceb5573f8159 | 8ab97904e6daea8866dbdbc4fb7aad2c | delivered | 2018-02-13 21:18:39 | 2018-02-13 22:20:29 | 2018-02-14 19:46:34 | 2018-02-16 18:17:02 | 2018-02-26 00:00:00 |
| index | product_id | product_category_name | product_name_lenght | product_description_lenght | product_photos_qty | product_weight_g | product_length_cm | product_height_cm | product_width_cm | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1e9e8ef04dbcff4541ed26657ea517e5 | perfumaria | 40.0 | 287.0 | 1.0 | 225.0 | 16.0 | 10.0 | 14.0 |
| 1 | 1 | 3aa071139cb16b67ca9e5dea641aaa2f | artes | 44.0 | 276.0 | 1.0 | 1000.0 | 30.0 | 18.0 | 20.0 |
| 2 | 2 | 96bd76ec8810374ed1b65e291975717f | esporte_lazer | 46.0 | 250.0 | 1.0 | 154.0 | 18.0 | 9.0 | 15.0 |
| 3 | 3 | cef67bcfe19066a932b7673e239eb23d | bebes | 27.0 | 261.0 | 1.0 | 371.0 | 26.0 | 4.0 | 26.0 |
| 4 | 4 | 9dc1a7de274444849c219cff195d0b71 | utilidades_domesticas | 37.0 | 402.0 | 4.0 | 625.0 | 20.0 | 17.0 | 13.0 |
| index | seller_id | seller_zip_code_prefix | seller_city | seller_state | |
|---|---|---|---|---|---|
| 0 | 0 | 3442f8959a84dea7ee197c632cb2df15 | 13023 | campinas | SP |
| 1 | 1 | d1b65fc7debc3361ea86b5f14c68d2e2 | 13844 | mogi guacu | SP |
| 2 | 2 | ce3ad9de960102d0677a81f5d0bb7b2d | 20031 | rio de janeiro | RJ |
| 3 | 3 | c0f3eea2e14555b6faeea3dd58c1b1c3 | 4195 | sao paulo | SP |
| 4 | 4 | 51a04a8a6bdcb23deccc82b0b80742cf | 12914 | braganca paulista | SP |
| index | product_category_name | product_category_name_english | |
|---|---|---|---|
| 0 | 0 | beleza_saude | health_beauty |
| 1 | 1 | informatica_acessorios | computers_accessories |
| 2 | 2 | automotivo | auto |
| 3 | 3 | cama_mesa_banho | bed_bath_table |
| 4 | 4 | moveis_decoracao | furniture_decor |
1.2 Data Overview¶
- Objective: Understand the contents of the database.
- Key Tables:
- Orders: Contains order history and delivery details.
- Customers: Includes customer demographic and geographical data.
- Products: Provides product details and categories.
- Reviews: Contains customer satisfaction reviews.
- Sellers: Includes seller information and performance metrics.
1.3 Query 1: Late Deliveries Analysis¶
- Objective: Identify recent orders (less than 3 months old) that were delivered with a delay of at least 3 days, excluding canceled orders.
- Insights:
- Helps assess delivery performance and identify areas for improvement.
query_late_deliveries = """
WITH delivery_stats AS (
SELECT MAX(order_purchase_timestamp) as latest_estimate
FROM orders
WHERE order_status != 'canceled'
),
delayed_deliveries AS (
SELECT o.*
FROM orders o, delivery_stats d
WHERE
o.order_estimated_delivery_date IS NOT NULL AND o.order_delivered_customer_date IS NOT NULL
AND o.order_status != 'canceled'
AND o.order_delivered_customer_date > date(o.order_estimated_delivery_date, '+3 days')
AND o.order_purchase_timestamp BETWEEN
date(d.latest_estimate, '-3 months')
AND d.latest_estimate
)
SELECT * FROM delayed_deliveries;
"""
df_late_deliveries = db.execute_query(query_late_deliveries)
display(df_late_deliveries)
| index | order_id | customer_id | order_status | order_purchase_timestamp | order_approved_at | order_delivered_carrier_date | order_delivered_customer_date | order_estimated_delivery_date | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 561 | cfa4fa27b417971e86d8127cb688712f | 7093250e1741ebbed41f0cc552025fd6 | delivered | 2018-08-16 09:44:23 | 2018-08-17 03:10:22 | 2018-08-17 18:55:00 | 2018-08-29 01:41:41 | 2018-08-22 00:00:00 |
| 1 | 711 | 234c056c50619f48da64f731c48242b4 | 44e460a655f7154ccd9faa4dbbbaf68a | delivered | 2018-08-14 14:49:15 | 2018-08-15 03:15:15 | 2018-08-31 15:25:00 | 2018-09-01 18:14:42 | 2018-08-23 00:00:00 |
| 2 | 728 | 8ad3f1d0f96992e43566c4c82c9f6c58 | 948b29e24216a05fea13a18d8db45ea5 | delivered | 2018-07-17 21:25:29 | 2018-07-17 21:35:17 | 2018-07-18 13:08:00 | 2018-08-14 04:04:40 | 2018-08-03 00:00:00 |
| 3 | 1335 | 43f9a3d3e9d95525bfce717e31f72a56 | b003d09f32a12bcc00b6ca04d46554e6 | delivered | 2018-07-11 00:13:33 | 2018-07-12 03:10:16 | 2018-07-27 16:27:00 | 2018-08-04 16:15:46 | 2018-07-30 00:00:00 |
| 4 | 1844 | 7f579e203c931f3e8410103359c6d523 | d665be250d1c687c58fdea61a9b55a58 | delivered | 2018-08-02 18:09:27 | 2018-08-02 18:43:43 | 2018-08-10 14:15:00 | 2018-08-13 20:11:47 | 2018-08-09 00:00:00 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 439 | 98368 | 874a7690bc049bd4ce210d195bdfff7b | f6b2c83ea477af36f44a1e2df7320036 | delivered | 2018-08-02 12:06:47 | 2018-08-02 13:05:56 | 2018-08-10 14:42:00 | 2018-08-13 15:50:48 | 2018-08-08 00:00:00 |
| 440 | 98382 | 587e32dd528769d669d539531d32aeb5 | 0af8a25fb0b6f833724bb5818275807b | delivered | 2018-08-10 11:46:09 | 2018-08-11 02:50:25 | 2018-08-14 10:09:00 | 2018-09-03 09:32:31 | 2018-08-28 00:00:00 |
| 441 | 98564 | 3adb141ba4bd69dd7fe8d3fb733c6b74 | c0539d5c87fc7c97a8418adffe4b45f0 | delivered | 2018-08-14 23:29:21 | 2018-08-16 03:05:11 | 2018-08-16 13:28:00 | 2018-08-28 18:02:52 | 2018-08-24 00:00:00 |
| 442 | 99279 | 4cf09d9e5ebbe0f91ddd7bf9aae891cd | 07b6b4fe5fefb948fc76b6d2bdba77d8 | delivered | 2018-07-19 08:37:26 | 2018-07-21 03:25:17 | 2018-07-23 15:31:00 | 2018-08-21 01:12:45 | 2018-08-10 00:00:00 |
| 443 | 99396 | b3112ca67f3afd4e20cf2ee91fc4f804 | 6f83c71b6c044fb156d697d4130fe9b5 | delivered | 2018-08-02 22:46:54 | 2018-08-02 23:04:06 | 2018-08-15 17:42:00 | 2018-08-21 00:03:26 | 2018-08-16 00:00:00 |
444 rows × 9 columns
query_last_6months_deliveries = """
WITH delivery_stats AS (
SELECT MAX(order_purchase_timestamp) as latest_estimate
FROM orders
WHERE order_status != 'canceled'
),
delayed_deliveries AS (
SELECT o.*
FROM orders o, delivery_stats d
WHERE
o.order_estimated_delivery_date IS NOT NULL AND o.order_delivered_customer_date IS NOT NULL
AND o.order_status != 'canceled'
AND o.order_purchase_timestamp BETWEEN
date(d.latest_estimate, '-6 months')
AND d.latest_estimate
)
SELECT * FROM delayed_deliveries;
"""
df_deliveries_6months = db.execute_query(query_last_6months_deliveries)
display(df_deliveries_6months)
| index | order_id | customer_id | order_status | order_purchase_timestamp | order_approved_at | order_delivered_carrier_date | order_delivered_customer_date | order_estimated_delivery_date | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 53cdb2fc8bc7dce0b6741e2150273451 | b0830fb4747a6c6d20dea0b8c802d7ef | delivered | 2018-07-24 20:41:37 | 2018-07-26 03:24:27 | 2018-07-26 14:31:00 | 2018-08-07 15:27:45 | 2018-08-13 00:00:00 |
| 1 | 2 | 47770eb9100c2d0c44946d9cf07ec65d | 41ce2a54c0b03bf3443c3d931a367089 | delivered | 2018-08-08 08:38:49 | 2018-08-08 08:55:23 | 2018-08-08 13:50:00 | 2018-08-17 18:06:29 | 2018-09-04 00:00:00 |
| 2 | 12 | 82566a660a982b15fb86e904c8d32918 | d3e3b74c766bc6214e0c830b17ee2341 | delivered | 2018-06-07 10:06:19 | 2018-06-09 03:13:12 | 2018-06-11 13:29:00 | 2018-06-19 12:05:52 | 2018-07-18 00:00:00 |
| 3 | 13 | 5ff96c15d0b717ac6ad1f3d77225a350 | 19402a48fe860416adf93348aba37740 | delivered | 2018-07-25 17:44:10 | 2018-07-25 17:55:14 | 2018-07-26 13:16:00 | 2018-07-30 15:52:25 | 2018-08-08 00:00:00 |
| 4 | 15 | dcb36b511fcac050b97cd5c05de84dc3 | 3b6828a50ffe546942b7a473d70ac0fc | delivered | 2018-06-07 19:03:12 | 2018-06-12 23:31:02 | 2018-06-11 14:54:00 | 2018-06-21 15:34:32 | 2018-07-04 00:00:00 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 38618 | 99425 | d692ef54145c9cb3322ec2e5508aa3f4 | 82ddfcf9438b0cd1117b55ac33184df8 | delivered | 2018-03-21 19:47:18 | 2018-03-21 20:05:26 | 2018-03-22 21:11:58 | 2018-04-11 00:48:31 | 2018-04-09 00:00:00 |
| 38619 | 99427 | 19dba6d1e58b04e7fc820558a38ea398 | cf97e9cf454480b303de6736ddd15fd6 | delivered | 2018-05-12 13:42:24 | 2018-05-12 13:55:17 | 2018-05-16 08:14:00 | 2018-05-18 17:18:29 | 2018-06-04 00:00:00 |
| 38620 | 99428 | c22a47117b6a87c967b0c278488110c1 | 6e153567dc52ea59c0498ef5383e9c7a | delivered | 2018-06-22 20:53:29 | 2018-06-22 21:17:45 | 2018-06-25 17:50:00 | 2018-07-03 19:48:54 | 2018-07-23 00:00:00 |
| 38621 | 99430 | c81f74e50f0496fa39716cc77cacd460 | 7b270ebc87c25c8404348c10ff80a80e | delivered | 2018-03-04 22:48:38 | 2018-03-04 23:09:22 | 2018-03-05 16:15:04 | 2018-03-13 18:54:32 | 2018-03-26 00:00:00 |
| 38622 | 99440 | 66dea50a8b16d9b4dee7af250b4be1a5 | edb027a75a1449115f6b43211ae02a24 | delivered | 2018-03-08 20:57:30 | 2018-03-09 11:20:28 | 2018-03-09 22:11:59 | 2018-03-16 13:08:30 | 2018-04-03 00:00:00 |
38623 rows × 9 columns
from src.scripts.visualize_late_deliveries import visualize_late_deliveries
# Create visualization for late deliveries
fig = visualize_late_deliveries(df_deliveries_6months)
fig.show()
1.4 Query 2: High Revenue Sellers¶
- Objective: Identify sellers who generated over 100,000 BRL in revenue from delivered orders.
- Insights:
- Highlights top-performing sellers for potential partnerships or incentives.
query_high_revenue = """
SELECT
s.seller_id,
CAST(SUM(oi.price + oi.freight_value) AS INTEGER) as total_revenue
FROM sellers s
JOIN order_items oi ON s.seller_id = oi.seller_id
JOIN orders o ON oi.order_id = o.order_id WHERE o.order_status = 'delivered'
GROUP BY s.seller_id
HAVING total_revenue > 100000
ORDER BY total_revenue DESC;
"""
df_high_revenue = db.execute_query(query_high_revenue)
display(df_high_revenue)
| seller_id | total_revenue | |
|---|---|---|
| 0 | 4869f7a5dfa277a7dca6462dcf3b52b2 | 247007 |
| 1 | 7c67e1448b00f6e969d365cea6b010ab | 237806 |
| 2 | 4a3ca9315b744ce9f8e9374361493884 | 231220 |
| 3 | 53243585a1d6dc2643021fd1853d8905 | 230797 |
| 4 | fa1c13f2614d7b5c4749cbc52fecda94 | 200833 |
| 5 | da8622b14eb17ae2831f4ac5b9dab84a | 184706 |
| 6 | 7e93a43ef30c4f03f38b393420bc753a | 171973 |
| 7 | 1025f0e2d44d7041d6cf58b6550e0bfa | 171924 |
| 8 | 7a67c85e85bb2ce8582c35f2203ad736 | 160278 |
| 9 | 955fee9216a65b617aa5c0531780ce60 | 156606 |
| 10 | 6560211a19b47992c3666cc44a7e94c0 | 148050 |
| 11 | 1f50f920176fa81dab994f9023523100 | 141712 |
| 12 | 46dc3b2cc0980fb8ec44634e21d2718e | 134162 |
| 13 | a1043bafd471dff536d0c462352beb48 | 130412 |
| 14 | 620c87c171fb2a6dd6e8bb4dec959fc6 | 126278 |
| 15 | cc419e0650a3c5ba77189a1882b7556a | 125936 |
| 16 | 5dceca129747e92ff8ef7a997dc4f8ca | 124702 |
| 17 | 7d13fca15225358621be4086e1eb0964 | 120934 |
| 18 | 3d871de0142ce09b7081e2b9d1733cb1 | 115515 |
from src.scripts.visualize_top_sellers import visualize_top_sellers
# Create visualization for high revenue sellers
fig = visualize_top_sellers(df_high_revenue)
fig.show()
1.5 Query 3: Engaged New Sellers¶
- Objective: Identify new sellers (active for less than 3 months) who have sold more than 30 products.
- Insights:
- Helps track the onboarding success of new sellers.
query_new_engaged_sellers = """
WITH seller_stats AS (
SELECT
s.seller_id,
MIN(o.order_purchase_timestamp) as first_sale,
COUNT(DISTINCT oi.order_id) as total_orders,
COUNT(oi.product_id) as total_products
FROM sellers s
JOIN order_items oi ON s.seller_id = oi.seller_id
JOIN orders o ON oi.order_id = o.order_id
GROUP BY s.seller_id
),
latest_purchase_date AS (
SELECT MAX(order_purchase_timestamp) as latest_purchase
FROM orders
)
SELECT s.*
FROM seller_stats s, latest_purchase_date l
WHERE s.total_products > 30
AND s.first_sale >= date(l.latest_purchase, '-3 months')
ORDER BY s.total_products DESC;
"""
df_new_engaged_sellers = db.execute_query(query_new_engaged_sellers)
display(df_new_engaged_sellers)
| seller_id | first_sale | total_orders | total_products | |
|---|---|---|---|---|
| 0 | d13e50eaa47b4cbe9eb81465865d8cfc | 2018-08-04 09:09:37 | 67 | 69 |
| 1 | 81f89e42267213cb94da7ddc301651da | 2018-08-08 12:45:12 | 46 | 52 |
| 2 | 240b9776d844d37535668549a396af32 | 2018-07-17 13:48:59 | 36 | 36 |
1.6 Query 4: Worst Reviewed Postal Codes¶
- Objective: Identify the top 5 postal codes with the worst average review scores (minimum 30 reviews) in the last 12 months.
- Insights:
- Pinpoints geographical areas with potential service or product quality issues.
query_worst_postcodes = """
WITH latest_purchase_date AS (
SELECT MAX(order_purchase_timestamp) as latest_purchase
FROM orders
)
SELECT
c.customer_zip_code_prefix,
COUNT(r.review_id) as review_count,
ROUND(AVG(r.review_score), 2) as avg_score
FROM customers c, latest_purchase_date l
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_reviews r ON o.order_id = r.order_id
WHERE r.review_creation_date >= date(l.latest_purchase, '-12 months')
GROUP BY c.customer_zip_code_prefix
HAVING review_count >= 30
ORDER BY avg_score ASC
LIMIT 5;
"""
df_worst_postcodes= db.execute_query(query_worst_postcodes)
display(df_worst_postcodes)
| customer_zip_code_prefix | review_count | avg_score | |
|---|---|---|---|
| 0 | 22753 | 48 | 2.83 |
| 1 | 22770 | 38 | 3.18 |
| 2 | 22793 | 93 | 3.25 |
| 3 | 13056 | 31 | 3.29 |
| 4 | 13295 | 32 | 3.31 |
query_all_postcodes = """
WITH latest_purchase_date AS (
SELECT MAX(order_purchase_timestamp) as latest_purchase
FROM orders
)
SELECT
c.customer_zip_code_prefix,
COUNT(r.review_id) as review_count,
ROUND(AVG(r.review_score), 2) as avg_score
FROM customers c, latest_purchase_date l
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_reviews r ON o.order_id = r.order_id
WHERE r.review_creation_date >= date(l.latest_purchase, '-12 months')
GROUP BY c.customer_zip_code_prefix
HAVING review_count >= 1
ORDER BY avg_score ASC;
"""
df_all_postcodes= db.execute_query(query_all_postcodes)
from src.scripts.visualize_customer_review_bubble import create_brazil_postcode_map
# Create and display the map
fig = create_brazil_postcode_map(df_all_postcodes)
fig.show()
2. Data Extraction and Feature Engineering¶
2.1 RFM Features¶
- Objective: Calculate Recency, Frequency, and Monetary (RFM) metrics for each customer.
- Recency: Time since the last purchase.
- Frequency: Number of orders placed.
- Monetary: Total spending by the customer.
- Insights:
- RFM metrics are foundational for customer segmentation.
query_rfm = """
WITH last_purchase_date AS (
SELECT MAX(order_purchase_timestamp) as max_date
FROM orders
WHERE order_status = 'delivered'
)
SELECT
c.customer_id,
MAX(o.order_purchase_timestamp) as order_purchase_timestamp,
-- Recency
CAST(JULIANDAY(l.max_date) - JULIANDAY(MAX(o.order_purchase_timestamp)) AS INTERGER) as recency_days,
-- Frequency
COUNT(o.order_id) as frequency,
-- Monetary
CAST(SUM(oi.price + oi.freight_value) AS INTEGER) as monetary
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
CROSS JOIN last_purchase_date l
WHERE o.order_status = 'delivered'
GROUP BY c.customer_id;
"""
df_rfm= db.execute_query(query_rfm)
display(df_rfm)
| customer_id | order_purchase_timestamp | recency_days | frequency | monetary | |
|---|---|---|---|---|---|
| 0 | 00012a2ce6f8dcda20d059ce98491703 | 2017-11-14 16:08:26 | 287 | 1 | 114 |
| 1 | 000161a058600d5901f007fab4c27140 | 2017-07-16 09:40:32 | 409 | 1 | 67 |
| 2 | 0001fd6190edaaf884bcaf3d49edf079 | 2017-02-28 11:06:43 | 547 | 1 | 195 |
| 3 | 0002414f95344307404f0ace7a26f1d5 | 2017-08-16 13:09:20 | 378 | 1 | 179 |
| 4 | 000379cdec625522490c315e70c7a9fb | 2018-04-02 13:42:17 | 149 | 1 | 107 |
| ... | ... | ... | ... | ... | ... |
| 96473 | fffcb937e9dd47a13f05ecb8290f4d3e | 2018-03-17 00:55:27 | 165 | 1 | 91 |
| 96474 | fffecc9f79fd8c764f843e9951b11341 | 2018-03-29 16:59:26 | 152 | 1 | 81 |
| 96475 | fffeda5b6d849fbd39689bb92087f431 | 2018-05-22 13:36:02 | 99 | 1 | 63 |
| 96476 | ffff42319e9b2d713724ae527742af25 | 2018-06-13 16:57:05 | 76 | 1 | 214 |
| 96477 | ffffa3172527f765de70084a7e53aae8 | 2017-09-02 11:53:32 | 361 | 2 | 45 |
96478 rows × 5 columns
2.2 Customer Satisfaction Metrics¶
- Objective: Analyze customer satisfaction based on review data.
- Average Review Score: Overall satisfaction level.
- Review Count: Number of reviews submitted.
- Negative Reviews: Count of reviews with scores ≤ 2.
- Insights:
- Helps identify dissatisfied customers and areas for improvement.
query_satisfaction = """
WITH customers_with_negative AS (
SELECT DISTINCT o.customer_id, 1 as has_negative_flag
FROM orders o
JOIN order_reviews r ON o.order_id = r.order_id
WHERE r.review_score <= 2
)
SELECT
o.customer_id,
CAST(AVG(r.review_score) AS INTEGER) as avg_review_score,
COUNT(r.review_id) as review_count,
COALESCE(cwn.has_negative_flag, 0) as has_negative_flag
FROM orders o
JOIN order_reviews r ON o.order_id = r.order_id
LEFT JOIN customers_with_negative cwn ON o.customer_id = cwn.customer_id
GROUP BY o.customer_id;
"""
df_satisfaction = db.execute_query(query_satisfaction)
display(df_satisfaction)
| customer_id | avg_review_score | review_count | has_negative_flag | |
|---|---|---|---|---|
| 0 | 00012a2ce6f8dcda20d059ce98491703 | 1 | 1 | 1 |
| 1 | 000161a058600d5901f007fab4c27140 | 4 | 1 | 0 |
| 2 | 0001fd6190edaaf884bcaf3d49edf079 | 5 | 1 | 0 |
| 3 | 0002414f95344307404f0ace7a26f1d5 | 5 | 1 | 0 |
| 4 | 000379cdec625522490c315e70c7a9fb | 4 | 1 | 0 |
| ... | ... | ... | ... | ... |
| 98668 | fffecc9f79fd8c764f843e9951b11341 | 3 | 1 | 0 |
| 98669 | fffeda5b6d849fbd39689bb92087f431 | 4 | 1 | 0 |
| 98670 | ffff42319e9b2d713724ae527742af25 | 5 | 1 | 0 |
| 98671 | ffffa3172527f765de70084a7e53aae8 | 5 | 1 | 0 |
| 98672 | ffffe8b65bbe3087b653a978c870db99 | 5 | 1 | 0 |
98673 rows × 4 columns
2.3 Customer Behavior Patterns¶
- Objective: Extract behavioral insights from customer purchase data.
- Product Category Diversity: Number of unique product categories purchased.
- Average Delivery Time: Time taken for orders to be delivered.
- Unique Sellers: Number of sellers a customer has interacted with.
- Insights:
- Provides a deeper understanding of customer preferences and habits.
query_behavior = """
SELECT
o.customer_id,
COUNT( p.product_category_name) as unique_categories,
AVG(CAST(JULIANDAY(order_delivered_customer_date) -
JULIANDAY(order_purchase_timestamp) AS INTEGER)) as avg_delivery_time,
COUNT( oi.seller_id) as unique_sellers
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE o.order_status = 'delivered'
GROUP BY o.customer_id;
"""
df_behavior= db.execute_query(query_behavior)
display(df_behavior)
| customer_id | unique_categories | avg_delivery_time | unique_sellers | |
|---|---|---|---|---|
| 0 | 00012a2ce6f8dcda20d059ce98491703 | 1 | 13.0 | 1 |
| 1 | 000161a058600d5901f007fab4c27140 | 1 | 9.0 | 1 |
| 2 | 0001fd6190edaaf884bcaf3d49edf079 | 1 | 5.0 | 1 |
| 3 | 0002414f95344307404f0ace7a26f1d5 | 1 | 28.0 | 1 |
| 4 | 000379cdec625522490c315e70c7a9fb | 1 | 11.0 | 1 |
| ... | ... | ... | ... | ... |
| 96473 | fffcb937e9dd47a13f05ecb8290f4d3e | 1 | 4.0 | 1 |
| 96474 | fffecc9f79fd8c764f843e9951b11341 | 1 | 12.0 | 1 |
| 96475 | fffeda5b6d849fbd39689bb92087f431 | 1 | 17.0 | 1 |
| 96476 | ffff42319e9b2d713724ae527742af25 | 1 | 5.0 | 1 |
| 96477 | ffffa3172527f765de70084a7e53aae8 | 2 | 12.0 | 2 |
96478 rows × 4 columns
3. Feature Analysis and Transformation¶
3.1 Merge Features¶
- Objective: Combine RFM metrics, satisfaction metrics, and behavior patterns into a unified dataset for analysis.
- Outcome:
- A consolidated DataFrame ready for further analysis and modeling.
from src.classes.feature_analysis import FeatureAnalysis
from src.classes.feature_engineering import FeatureEngineering
# Initialize with existing query results
fe = FeatureEngineering(df_rfm, df_satisfaction, df_behavior)
features_df = fe.combine_features()
# Plot results
fa = FeatureAnalysis(features_df, columns_to_exclude=['order_id', 'customer_id'])
dist_plot = fa.plot_distributions()
dist_plot.show()
3.2 Feature Distribution Analysis¶
- Objective: Explore the distribution of features to identify patterns, outliers, and preprocessing needs.
- Steps:
- Statistical Summaries: Compute mean, median, standard deviation, etc.
- Distribution Plots: Visualize feature distributions using histograms or density plots.
- Outlier Detection: Use box plots to identify extreme values.
- Outcome:
- Informs decisions on scaling, normalization, and handling outliers.
from src.scripts.analyze_numerical_outliers import analyze_outliers_with_multiple_methods
# Analyze all numeric variables with different outlier detection methods
all_summaries, all_cleaned_dfs = analyze_outliers_with_multiple_methods(features_df)
3.3 Box Plot Analysis¶
- Objective: Visualize feature distributions and detect outliers.
- Steps:
- Plot box plots for numerical features.
- Highlight features with significant outliers.
- Outcome:
- Guides feature scaling and normalization.
from src.scripts.visualize_numerical_outliers import create_interactive_outlier_visualization
# Create the interactive outlier visualization
summary_df, df_cleaned = create_interactive_outlier_visualization(all_cleaned_dfs['Z-score (±3)'])
Data Summary:
| Column | Skewness | Mean | Median | StdDev | |
|---|---|---|---|---|---|
| 0 | recency_days | 0.431893 | 239.116453 | 220.0 | 152.830149 |
| 1 | frequency | 3.135869 | 1.078442 | 1.0 | 0.268868 |
| 2 | monetary | 2.327219 | 138.188126 | 103.0 | 119.647027 |
| 3 | avg_review_score | -1.481654 | 4.155585 | 5.0 | 1.284016 |
| 4 | review_count | 0.000000 | 1.000000 | 1.0 | 0.000000 |
| 5 | has_negative_flag | 2.220768 | 0.128464 | 0.0 | 0.334608 |
| 6 | unique_categories | 1.792656 | 1.063158 | 1.0 | 0.295683 |
| 7 | avg_delivery_time | 1.217369 | 11.374220 | 10.0 | 7.324842 |
| 8 | unique_sellers | 3.135869 | 1.078442 | 1.0 | 0.268868 |
3.4 RFM dashboard review after normalization¶
- Objective: Visualize RFM feature distributions after removal of outliers.
# Create a hybrid dataframe with cleaned continuous variables but original flags/counts
hybrid_df = all_cleaned_dfs['Z-score (±3)'].copy()
# List of columns to restore from original data
categorical_cols = ['has_negative_flag']
count_cols = ['review_count', 'unique_categories', 'frequency', 'unique_sellers']
# Replace the values with originals
for col in categorical_cols + count_cols:
if col in features_df.columns and col in hybrid_df.columns:
hybrid_df[col] = features_df[col]
# Use the hybrid dataframe for visualization
fa = FeatureAnalysis(hybrid_df, columns_to_exclude=['customer_id'])
dist_plot = fa.plot_distributions()
dist_plot.show()
3.3 Feature Correlations¶
- Objective: Analyze relationships between features using a correlation matrix.
- Steps:
- Compute the correlation matrix for numerical features.
- Visualize the lower triangle of the matrix using a heatmap.
- Outcome:
- Identifies highly correlated features and potential redundancies.
# Plot results
corr_plot = fa.plot_correlation_matrix()
corr_plot.show()
4. Implement clustering algorithms¶
4.1 Feature Transformations for Clustering¶
The feature transformation process implements a sophisticated approach to prepare customer data for clustering, ensuring optimal algorithm performance while maintaining interpretability:
Transformation Components¶
Data Preparation
- Set customer_id as the index to maintain customer identity throughout the analysis
- Removed temporal variables (order_purchase_timestamp) that aren't relevant for segmentation
- Created a clean foundation for applying consistent transformations
Specialized Transformation Strategy
- Implemented the
GenericFeatureTransformerto handle different variable types appropriately - Applied specific transformation methods to three distinct variable categories:
- Numerical variables: Applied robust scaling to minimize outlier impact
- Categorical indicators: Properly encoded the has_negative_flag variable
- Count variables: Applied special handling to preserve the count nature while standardizing scale
- Implemented the
Count Variable Preservation
- Recognized that count variables (review_count, unique_categories, frequency, unique_sellers) have special properties
- Implemented specialized transformations that maintain their inherent characteristics
- Ensured that zero counts remain meaningful in the transformed space
Reversible Transformations
- Stored transformation parameters to enable inverse transformation
- Created capability to convert clusters back to original feature space
- Ensured business stakeholders can understand results in familiar metrics
Outcome¶
The transformation process creates a dataset that:
- Places all variables on comparable scales for distance-based clustering algorithms
- Preserves the inherent structure and relationships between variables
- Minimizes the impact of outliers without losing their information
- Maintains the ability to interpret results in the original business context
This approach balances mathematical optimization with business interpretability, enabling both effective clustering and actionable insights from the resulting segments.
from src.classes.feature_transformation import GenericFeatureTransformer
#Set customer_id as index before transformation
hybrid_df = hybrid_df.set_index('customer_id')
hybrid_df = hybrid_df.drop(columns=['order_purchase_timestamp'], errors='ignore')
# Create and apply the generic transformer
transformer = GenericFeatureTransformer()
transformed_df = transformer.fit_transform(
df=hybrid_df,
categorical_cols=['has_negative_flag'],
count_cols=['review_count', 'unique_categories', 'frequency', 'unique_sellers']
)
# Get original data back when needed
original_df = transformer.inverse_transform(transformed_df)
4.2 Clustering Analysis with Advanced Feature Importance¶
The customer segmentation implementation uses a sophisticated dual-model approach combining PCA transformation with robust feature importance analysis:
Analysis Components¶
PCA Component Analysis
- Applied dimensionality reduction to visualize high-dimensional customer data
- Created explained variance plots to identify optimal components (87% variance captured in top 3 components)
- Generated biplots showing feature loadings on principal components for interpretability
- Identified which features contribute most to each principal component
Optimal Cluster Determination
- Implemented optimized elbow method with silhouette scoring
- Analyzed inertia and silhouette metrics to identify the ideal number of clusters
- Selected 3 clusters based on the balance between complexity and interpretability
Dual-Model K-Means Implementation
- Trained PCA-based model for visualization (using 3D PCA components)
- Trained feature-space model for interpretability (using original features)
- Maintained both models to maximize insights while enabling visualization
Feature Importance Analysis
- Applied permutation importance to identify cluster-defining features
- Used silhouette score reduction as the importance metric
- Visualized with error bars to show consistency of importance scores
- Sampling techniques applied for computational efficiency
Outcome¶
The analysis reveals distinct customer segments with clear behavioral differences, enabling targeted marketing strategies based on:
- Recency, frequency, and monetary (RFM) profile of each segment
- Satisfaction levels and complaint patterns
- Product category diversity and seller engagement patterns
The implementation balances computation speed and insight depth, providing actionable customer segmentation with clear understanding of which features define each segment.
import importlib
import src.classes.cluster_analysis
importlib.reload(src.classes.cluster_analysis)
from src.classes.cluster_analysis import ClusteringAnalysis
# Initialize clustering with the transformer for inverse transform capability
ca = ClusteringAnalysis(
df=transformed_df,
transformer=transformer
)
# Generate elbow plot to determine optimal number of clusters
elbow_plot = ca.plot_elbow(range(2, 11))
elbow_plot.show()
Explained variance ratio of PCA components: [4.46357323e-01 2.84277328e-01 1.40936669e-01 9.65714903e-02 1.89670526e-02 1.15127137e-02 1.16503765e-03 2.12385035e-04 0.00000000e+00 0.00000000e+00]
Computing clusters: 0%| | 0/9 [00:00<?, ?it/s]
Computing clusters: 11%|██████████████████ | 1/9 [00:05<00:41, 5.21s/it]
Computing clusters: 22%|████████████████████████████████████▏ | 2/9 [00:09<00:34, 4.92s/it]
Computing clusters: 33%|██████████████████████████████████████████████████████▎ | 3/9 [00:14<00:28, 4.81s/it]
Computing clusters: 44%|████████████████████████████████████████████████████████████████████████▍ | 4/9 [00:19<00:23, 4.69s/it]
Computing clusters: 56%|██████████████████████████████████████████████████████████████████████████████████████████▌ | 5/9 [00:23<00:18, 4.59s/it]
Computing clusters: 67%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 6/9 [00:27<00:13, 4.48s/it]
Computing clusters: 78%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 7/9 [00:31<00:08, 4.36s/it]
Computing clusters: 89%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 8/9 [00:35<00:04, 4.26s/it]
Computing clusters: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:40<00:00, 4.20s/it]
Computing clusters: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:40<00:00, 4.45s/it]
# Visualize PCA explained variance to select optimal components
fig_pca_variance = ca.plot_pca_explained_variance(max_components=15)
fig_pca_variance.show()
# Create PCA biplot to visualize feature loadings on principal components
fig_biplot = ca.plot_pca_biplot(n_features=5)
fig_biplot.show()
# Visualize feature importance in principal components
fig_pca_importance = ca.plot_pca_feature_importance(n_components=4)
fig_pca_importance.show()
# Run clustering with a different number of clusters (e.g., 5)
ca.fit_kmeans(n_clusters=4)
# Create and show silhouette visualization to evaluate cluster quality
silhouette_fig = ca.plot_silhouette_plotly()
silhouette_fig.show()
Sampling 20,000 records from 96478 for faster silhouette calculation...
# Create and display circle-based intercluster visualization
intercluster_fig = ca.plot_intercluster_distance_circles()
intercluster_fig.show()
Understanding the Clustering Approach¶
The clustering implementation uses a sophisticated dual-model approach:
PCA-based Clustering Model:
- Used for the elbow method and visualizations
- Applied to dimensionally-reduced data (PCA components)
- Advantages: Better visualization, handles correlated features
Original Feature Space Clustering Model:
- Used for feature importance analysis
- Applied directly to transformed features (not PCA)
- Advantages: Direct interpretability of feature effects
This dual approach explains why:
- The elbow plot may suggest different optimal clusters than silhouette analysis
- Feature importance relates to the original features, not PCA components
- We can both visualize clusters effectively and understand feature contributions## Understanding the Clustering Approach
The clustering implementation uses a sophisticated dual-model approach:
PCA-based Clustering Model:
- Used for the elbow method and visualizations
- Applied to dimensionally-reduced data (PCA components)
- Advantages: Better visualization, handles correlated features
Original Feature Space Clustering Model:
- Used for feature importance analysis
- Applied directly to transformed features (not PCA)
- Advantages: Direct interpretability of feature effects
This dual approach explains why:
- The elbow plot may suggest different optimal clusters than silhouette analysis
- Feature importance relates to the original features, not PCA components
- We can both visualize clusters effectively and understand feature contributions
# Define your custom segment configuration - adjust thresholds based on business requirements
custom_segment_config = {
# Recency thresholds (in days)
'recency': {
'active': 45, # 0-45 days: "Active" (adjusted from default 30)
'recent': 120, # 46-120 days: "Recent" (adjusted from default 90)
# > 120 days: "Inactive"
},
# Frequency thresholds (number of orders)
'frequency': {
'frequent': 2.5, # > 2.5 orders: "Frequent" (adjusted from default 3)
'returning': 1.2, # > 1.2 orders: "Returning" (adjusted from default 1.5)
# <= 1.2 orders: "One-time"
},
# Monetary thresholds
'monetary': {
'high_value': 150, # > $150: "High-value" (fixed value instead of default 'mean')
# <= $150: "Standard-value"
},
# Satisfaction thresholds (review scores)
'satisfaction': {
'very_satisfied': 4.7, # >= 4.7: "Very Satisfied" (adjusted from default 4.5)
'satisfied': 4.2, # >= 4.2: "Satisfied" (adjusted from default 4.0)
'neutral': 3.5, # >= 3.5: "Neutral" (adjusted from default 3.0)
# < 3.5: "Unsatisfied"
}
}
from src.scripts.cluster_dashboard import create_sales_dashboard
import numpy as np
# Store the original dataframe in the transformer for reference
transformer.original_df = hybrid_df
# Get the cluster labels from the clustering model
labels = ca.kmeans_results['labels']
print(f"Found {len(labels)} labels across {len(np.unique(labels))} clusters")
# Create the dashboard with custom segment configuration
dashboard = create_sales_dashboard(labels, transformed_df, transformer, custom_segment_config)
# Display the components
for name, fig in dashboard.items():
fig.show()
Found 96478 labels across 4 clusters
5. Evaluate Segment Stability¶
- Objective: Assess the stability of customer segments over time.
- Steps:
- Validate feature consistency across time periods.
- Compute stability scores for each segment.
- Visualize stability trends.
- Outcome:
- Ensures segments remain meaningful and actionable over time.